web : sbc.shef.ac.uk
twitter: @SheffBioinfCore
email: bioinformatics-core@sheffield.ac.uk
An overall workflow for the processing and analysis of RNA-seq data is given in the image below from Ting-You Wang’s RNA-seq analysis page.
In this workshop we are going to concentrate on differential-expression and pathways analysis - which are the least computationally-intensive parts of the workflow and require the least Bioinformatics experience.
For those interested in alignment and QC steps, we have some materials available that use the Galaxy online resource
We will also have some courses on using the command-line and the nextflow workflow manager to process raw RNA-seq data.
However, regardless of whatever method you use to process the data, decisions that you make before commencing sequencing can have a huge impact on the results.
A short introduction to Next Generation Sequencing can be found in this youtube video
A good overview of RNA-seq analysis can be found here
We will now briefly describe the processes involving in turning raw sequencing data into the data that we will be using in this workshop.
Raw raw-seq data are delivered in the form of fastq files. These are large (typically several Gb) that contain information on the sequences that have been generated for each biological sample; one fastq (or pair of fastqs) for each sample. Each set of four lines describe one sequence (called a “read”).
A typical RNA-seq experiment will have 10 - 30 million reads in a fastq file, with each read about 100 bases long
@D0UW5ACXX120511:8:1204:6261:40047/1
AATGTTTATGTTCTTAAATTTTAGTTGTATATGTGAATCTTTGTAGTTTTTGCTAAAATACTAAGTAATTTATATAAAAGTGAGTTAAGAGATTTTTCTGA
+
CCCFFFFFHHHHHJJJJJIJJJJJIJJHIIJIJIJJIJJJIJJHIIHIJJJJJJBEGIHIJICGIDICFGIJJJIIJJGJ>F>GAGCGEEHEHHEEFFFD>
Quality assessment can be performed to see if the raw sequences are of sufficient quality for analysis
As the fastq files are large, we tend to analyse them using command-line software and a computing cluster
The traditional workflow for RNA-seq compares the sequences to a reference genome to see which genomic region each read matches the best.
Again, this requires more memory than a typical laptop or desktop machine so is performed on a remote computer with large memory. The resulting file is called a bam and records the best genomic match for each read. However, as we are interested in gene expression we want to relate these mappings to the positions of genes.
A variety of different counting methods can determine how many reads overlap each known gene region. These are know as the raw counts and are the kind of data we will start with today.
Recent tools for RNA-seq analysis (e.g. salmon,
kallisto) do not require the time-consuming step of
whole-genome alignment to be performed, and can therefore produce
gene-level counts in a much faster time frame. They not require the
creation of large bam files, which is useful if constrained by file
space on Galaxy.
(image from Harvard Bioinformatics Core)
Before embarking on any high-throughput experiment, it is important to pay due attention to the experimental design. This famous quote from the statistician R.A Fisher in the 1938 is still applicable to modern technologies
To call in the statistician after the experiment is done may be no more than asking him to perform a postmortem examination: he may be able to say what the experiment died of.
Experimental Design encompasses questions such as
When performing a high-throughput experiment, our measurements will be subject to biological variation (which we may be interested in) and technical variation (which we probably won’t be). Being able to control these factors and minimise biases is key to experimental design.
Confounding factors in our design may arise by accident, or might be caused by not considering all possible sources of variation:-
(image from Cancer Research
UK Cambridge Institute course on Experimental Design)
We can also introduce so-called “batch-effects” by our choice of when samples are prepared for sequencing. Large experiments may necessitate multiple runs or batches, and we should try and minimize the possible impact of batches but including a good representation of each condition of interest in each batch.
(image from Cancer Research UK
Cambridge Institute course on Experimental Design)
When planning next-generation sequencing experiments, you will also need to consider
Some recommendations on these questions and more are provided by the Cancer Research Uk Cambridge Institute Genomics Core. Often the sequencing vendor performing your experiment will have some default options available.
The vendor may not advise on the sample-size; how many
samples you will be sequencing to address your biological hypothesis of
interest. This is a complex question and is often influenced by
practical and financial constraints. The Sheffield Bioinformatics Core
is able to advise on this, and any of the other issues above.
bioinformatics-core@sheffield.ac.uk
A differential expression analysis requires two input files to be created.
The count matrix can be obtained by performing quantification (outside the scope of this workshop…). This will usually be generated for you by the sequencing vendor or Bioinformatics Core. The structure of the count matrix is shown below.
| gene | sampleA | sampleB |
|---|---|---|
| A | 1500 | 900 |
| B | 20 | 10 |
| … | … | … |
The gene named A was sequenced 1500 times on sampleA and 900 times on sampleB etc. These are referred to as raw counts, and cannot just put this numbers into a standard statistical test (e.g. t-test) to assess significance. There are several reasons for this.
The term differential expression was first used to refer to the process of finding statistically significant genes from a microarray gene expression study.
Such methods were developed on the premise that microarray expression values are approximately normally-distributed when appropriately transformed (e.g. by using a log\(_2\) transformation) so that a modified version of the standard t-test can be used. The same test is applied to each gene under investigation yielding a test statistic, fold-change and p-value. Similar methods have been adapted to RNA-seq data to account for the fact that the data are count-based and do not follow a normal distribution.
Degust is a web tool that can analyse counts files and
test for differential gene expression. It offers and interactive view of
the differential expression results and also sample quality
assessment.
R-based methods such as edgeR (implemented in Degust)
and DESeq2 have their own method of normalising counts. You
will probably encounter other methods of normalising RNA-seq reads such
as RPKM, CPM, TPM etc. This
blog provides a nice explanation of the current thinking. As part of
the Degust output, you have the option of downloading
normalised counts in various formats. Some other online visualisation
tools require normalised counts as input, so it is good to have these
to-hand.
We will use a previously-published count matrix. This was downloaded from the Gene Expression Omnibus (GEO) under the accession number GSE60450. Note that we have shortened the column headings and added gene symbols to help with visualisation and annotation
Download the counts from this link
GSE60450_Lactation-GenewiseCounts_rename_symbol.csv, and
click Open.| Run | Name | CellType | Status |
|---|---|---|---|
| SRR1552444 | MCL1-LA | basal | virgin |
| SRR1552445 | MCL1-LB | luminal | virgin |
| SRR1552446 | MCL1-LC | Luminal | pregnancy |
| SRR1552447 | MCL1-LD | Luminal | pregnancy |
| SRR1552448 | MCL1-LE | luminal | lactation |
| SRR1552449 | MCL1-LF | luminal | lactation |
| SRR1552450 | MCL1-DG | basal | virgin |
| SRR1552451 | MCL1-DH | luminal | virgin |
| SRR1552452 | MCL1-DI | basal | pregnancy |
| SRR1552453 | MCL1-DJ | basal | pregnancy |
| SRR1552454 | MCL1-DK | basal | lactation |
| SRR1552455 | MCL1-DL | basal | lactation |
(Not that the screenshots are for illustration purposes and taken from a different dataset to that being analysed in the tutorial)
This is a multidimensional scaling plot which represents the variation between samples. It is a similar concept to a Principal Components Analysis (PCA) plot. The x-axis is the dimension with the highest magnitude. In a standard control/treatment setup, samples should be split along this axis. A desirable plot is shown below:-
Each dot shows the change in expression in one gene.
Click on the dot to see the gene name.
Each line shows the change in expression in one gene, between control and treatment.
Table of genes
The table can be sorted according to any of the columns (e.g. fold-change or p-value)
Above the genes table is the option to download the results of the current analysis to a csv file. You can also download the R code required to reproduce the analysis by clicking the Show R code box underneath the Options box.
Plots such as the MDS, MA and heatmap can also be exported by right-clicking on the plot.
Question: Do the sample groupings in the MDS plot make sense? Do any samples appear to be mislabeled? What effect might this have on the analysis?
Question: Having identified the problem with the analysis, modify the configuration and repeat. How many genes are differentially expressed this time?
Comparing Basal vs Luminal wasn’t really the main question of interest in the dataset, but it serves to illustrate the importance of checking QC plots.
Take some time to understand the various parts of the report
Exercise: Make sure the FDR cut-off and abs LogFC
cutoffs are set to default and download the file and rename to
background.csv. We will use this later.
Exercise: How many genes are
differentially-expressed with an FDR < 0.05 and abs logFC > 1.
Download this file and rename it to
B.preg_vs_lactation.csv.
Exercise: Repeat the analysis for Luminal.Pregnant vs Luminal.Lactation and download the table of differentially-expressed results (same FDR and log fold-change).
If you didn’t manage to complete these analyses, you can download the files from here by right-clicking on each link and selecting “Save Link as” (or equivalent). They are also available in the course google drive.
We might sometimes want to compare the lists of genes that we identify using different methods, or genes identified from more than one contrast. In our example dataset we can compare the genes in the contrast of pregnant vs luminal in basal and luminal cells
The website venny provides a really nice interface for doing this.
In this section we will use the following files
background.csv containing
one row for each gene in the comparison Basal.pregnant vs
Basal.lactation (27,179 rows).B.preg_vs_lactation.csv
containing one row for each found to be DE in the contrast
Basal.pregnant vs Basal.Lactation.It will be helpful to have both these files open in Excel.
In this section we move towards discovering if our results are biologically significant. Are the genes that we have picked statistical flukes, or are there some commonalities.
There are two different approaches one might use, and we will cover the theory behind both. The distinction is whether you are happy to use a hard (and arbitrary) threshold to identify DE genes.
“Threshold-based” methods require defintion of a statistical threshold to define list of genes to test (e.g. FDR < 0.01). Then a hypergeometric test or Fisher’s Exact test is generally used. These are typically used in situations where plenty of DE genes have been identified, and people often use quite relaxed criteria for identifying DE genes (e.g. raw rather than adjusted p-values or FDR value)
The question we are asking here is;
“Are the number of DE genes associated with Theme X significantly greater than what we might expect by chance alone?”
We can answer this question by knowing
The formula for Fishers exact test is;
\[ p = \frac{\binom{a + b}{a}\binom{c +d}{c}}{\binom{n}{a +c}} = \frac{(a+b)!(c+d)!(a+c)!(b+d)!}{a!b!c!d!n!} \]
with:-
| Differentially Expressed | Not Differentially Expressed | Total | |
|---|---|---|---|
| In Gene Set | a | b | a + b |
| Not in Gene Set | c | d | c + d |
| Total | a + c | b +d | a + b + c + d (=n) |
In this first test, our genes will be grouped together according to their Gene Ontology (GO) terms:- http://www.geneontology.org/
There are several popular online tools for performing enrichment analysis
We will be using the online tool GOrilla to perform the pathways analysis. It has two modes; the first of which accepts a list of background and target genes.
Mus MusculusTwo unranked lists of genesProcessSearch Enriched GO termsYou should be presented with a graph of enriched GO terms showing the relationship between the terms. Each GO term is coloured according to its statistical significance.
Below the figure is the results table. This links to more information about each GO term, and lists each gene in the category that was found in your list. The enrichment column gives 4 numbers that are used to determine enrichment (similar to the Fisher exact test we saw earlier)
Exercise: Use GOrilla to find enriched pathways in the Basal pregnant vs lactation analysis
This type of analysis is popular for datasets where differential expression analysis does not reveal many genes that are differentially-expressed on their own. Instead, it seeks to identify genes that as a group have a tendancy to be near the extremes of the log-fold changes. The results are typically presented in the following way.
The “barcode”-like panel represents where genes from a particular pathway (HALLMARK_E2F_TARGETS in this case) are located in a gene list ranked from most up-regulated to most down-regulated. The peak in the green curve is used to indicate where the majority of genes are located. If this is shifted to the left or the right it indicates that genes belonging to this gene set have a tendancy to be up- or down-regulated.
As such, it does not rely on having to impose arbitrary cut-offs on the data. Instead, we need to provide a measure of the importance of each gene such as it’s fold-change. These are then used the rank the genes.
The Broad institute has made this analysis method popular and provides a version of GSEA that can be run via a java application. However, the application can be a bit fiddly to run, so we will use the GeneTrail website instead
https://genetrail.bioinf.uni-sb.de/start.html
background.csv in Excel and delete all
columns except the SYMBOL and basal.lactation
column. Hopefully it should recognise your input without any errors, and on the next screen the Set-level statistic should be automatically set to GSEA
If your data does not get uploaded, double-check that the column heading basal.lactation has not been pasted into the text box
To make the analysis run faster, you can de-select the GO pathways (biological processes, molecular function and cellular compartment)
After a short wait, you will be able to view and download the results. The tested pathways are grouped into different sources (Kegg, Reactome or Wikipathways)
Each of the significant pathways can be explored in detail by clicking the More.. link; such as showing which genes in that pathways are up- or downregulated.
The Rank of the gene shown is the position of the gene in the ranked list; with 1 being most up-regulated gene. The score is the score used to rank the genes (fold-change in our example).
Exercise: Use GeneTrail to analyse